QTM 447 Lecture 23: Generative Models: Variational Autoencoders

Kevin McAlister

April 8, 2025

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Generative Models

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

All methods we’ll talk about can be sampled!

  • Differences in the ability to do the other 2

Generative Models

Generative Models

What makes a model generative?

We should be able to provide an answer to the question:

\[ P(\mathbf x | \boldsymbol \Theta) = ? \]

for any viable \(\mathbf x\).

  • A generative model is one where we can answer question about the structure of \(\mathbf x\)

  • Frequently under assumptions, but still can answer it!

Generative Models

PCA:

\[ \mathbf z = \mathbf W^T \mathbf x \]

\[ \mathbf x = \mathbf W \mathbf z \]

where \(\mathbf W\) is a \(P \times K\) weight matrix with \(K << P\).

Optimal solution under squared reconstruction error is \(\mathbf W = \mathbf Q_K\) where \(\mathbf Q_K\) is the first \(K\) eigenvectors of the covariance matrix

\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]

with \(\mathbf D\) being a diagonal matrix with the eigenvalues sorted from smallest to largest.

Generative Models

Generative goal:

\[ P(\mathbf x | \mathbf W) = \int P(\mathbf x | \mathbf W , \mathbf z) P(\mathbf z) d \mathbf z \]

For PCA, all we know is that \(\mathbf x = \mathbf W \mathbf z\) and that \(\mathbf z = \mathbf W^T \mathbf x\)

DOES NOT COMPUTE!!!

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

This is referred to as a deterministic bottleneck autoencoder

  • Learn a set of encoder and decoder functions that map \(\mathbf X\) to itself!

  • Restriction: each input instance passes through a low-dimensional bottleneck ( \(K << P\) )

  • Can’t just learn itself, so needs to set up \(\mathbf Z\) to represent as much of the variation in \(\mathbf X\) as possible

Note that PCA is a special case!

  • Restrict the feedforward layers to have linear activations and be invertible!

No good reason to do this when we have autodiff that can solve for arbitrarily complex models

Generative Autoencoder

Generative goal:

\[ P(\mathbf x | \Theta) = \int P(\mathbf x | f(\mathbf z), \Theta) P(\mathbf z) d \mathbf z \]

For a deterministic autoencoder, all we learn is \(f(\mathbf z)\) through our neural network backbone!

DOES NOT COMPUTE!!!

Generative Autoencoder

Solution: Assumptions. For centered \(\mathbf x\):

\[ P(\mathbf x | \mathbf z) = \mathcal N_P(\mathbf x | f_\mu(\mathbf z), f_\sigma(\mathbf z)) \]

  • Each \(\mathbf x\) is a random draw from a multivariate normal distribution with moments that are a function of the latent variable

  • Similar to PCA, just saying that we have some uncertainty in the mapping of \(\mathbf z \rightarrow \mathbf x\)

\[ P(\mathbf z) = \mathcal N_K(\mathbf z | 0 , \mathcal I_K) \]

  • Prior to seeing any data, we believe that each \(\mathbf z\) is a random draw from a standard multivariate normal

  • Inconsequential choice since we’re going to learn \(\mathbf z\) anyways

  • \(\mathbf z\) is latent, so we make the structure!

Factor Analysis

Made a little simpler/general:

\[ P(\mathbf x | \mathbf z) = \mathcal P(\mathbf x | f(\mathbf z, \boldsymbol \Theta)) \]

  • \(f(\mathbf z)\) is some mapping of the latent variable to the original feature space

  • \(\boldsymbol \Theta\) is a set of parameters that dictate the mapping

Factor Analysis

Our goal: Find values for \(\boldsymbol \Theta\) that maximize the likelihood with which we would observe our input features given the parameters.

\[ \hat{\boldsymbol \Theta} = \underset{\Theta}{\text{argmax }} \prod \limits_{i = 1}^N P(\mathbf x_i | z_i, \Theta) \]

But \(\mathbf x\) depends on the values of the latent variables, \(\mathbf z\), so:

\[ \hat{\Theta} = \underset{\Theta}{\text{argmax }} \prod \limits_{i = 1}^N \int P (\mathbf x_i | \mathbf W \mathbf z , \boldsymbol \Psi) P(\mathbf z_i) d\mathbf z \]

  • Integrate over the latent values since we’re learning these

Factor Analysis

Previously, we’ve seen that MLE problems can be made simpler by optimizing the log-likelihood

\[ \hat{\boldsymbol \Theta} = \underset{\Theta}{\text{argmax }} \sum \limits_{i = 1}^N \log \int P(\mathbf x_i | \mathbf z, \Theta) P(\mathbf z_i) d\mathbf z \]

Unfortunately, we can’t push the log inside the integral!

  • \(\log(a + b) \neq \log(a) + \log(b)\)

Finding the derivative of a log-integral is not particularly easy…

  • Not even for autodiff…

Factor Analysis

Maximizing this expression is tricky!

What’s holding us back here is that we need to learn \(\mathbf z\) and integrate over it

  • It would be way easier if we knew \(\mathbf z\) beforehand

  • But, \(\mathbf z\) is latent, so we learn it as a function of the data!

Factor Analysis

The Gaussian factor model has many different solution methods:

  • Eigendecomposition (it’s equivalent to PCA with a light twist)

  • MLE via an expectation-maximization routine

  • Bayesian MAP estimation using Gibbs sampling

We’re going to show a different method that will be extendable…

Variational Inference

Goal:

Find \(\boldsymbol \Theta\) such that we maximize the likelihood that we see our data:

\[ \hat{\boldsymbol \Theta} = \underset{\boldsymbol \Theta}{\text{argmax }} \sum \log P(\mathbf x_i | \boldsymbol \Theta) \]

  • Hard to do directly since the integral is a pain due to z

Variational Inference

Let’s start with Bayes rule:

\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)P(\mathbf z_i)}{P(\mathbf x_i | \boldsymbol \Theta)} \]

and rearrange to get the quantity that we want:

\[ P(\mathbf x_i | \boldsymbol \Theta) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)P(\mathbf z_i)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)} = \frac{P(\mathbf x_i , \mathbf z_i | \Theta)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)} \]

Things we know/define:

\[ P(\mathbf x_i | \mathbf z_i, \boldsymbol \Theta) \text{(Our Likelihood)} \text{ ; } P(\mathbf z_i) \text{(The prior)} \]

Things we don’t know:

\[ P(\mathbf x_i | \boldsymbol \Theta)\text{ (The marginal likelihood)} \text{ ; } P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)\text{(The posterior over the latent Z)} \]

Variational Inference

Another way to view these terms:

\[ P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta) \]

is a decoder - translate a latent vector of length \(K\) to the input space

\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) \]

is an encoder - translate an input to the latent space

Variational Inference

The encoder and decoder are probabilistic!

  • Input goes in and maps to a distribution in the latent space

  • Latent value is a distribution and maps out to another distribution in the input space

Variational Inference

Can’t do anything with this until we figure out how to find \(\boldsymbol \Theta\) that maximizes the marginal likelihood

Right now, our sticking point is the encoder:

\[ P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) \]

  • We don’t know what this is!

  • We only have a prior on \(\mathbf z\)

Variational Inference

Bayesian approaches to simplifying the posterior:

  • Analytical solution: Works in the base case, but not extendable to the autoencoder case

  • MAP approximation: Again, works in the base case, but MVNs are too simple for complicated structures

Third method: find an approximate posterior that is as close as possible to the true posterior

  • A more flexible MAP with Laplace approximation approach

Variational Inference

Solution: Come up with an approximate distribution that can closely learn the form of \(P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)\)

\[ Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi) \approx P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta) \]

  • \(Q()\) should be a distribution that is easy to work with like the multivariate normal distribution.

  • We’ll see why this works in a second

  • Way more flexible than you might think

Variational Inference

Multiply/divide by \(Q()\) (because we can):

\[ P(\mathbf x_i | \boldsymbol \Theta) = \frac{P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)P(\mathbf z_i)Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)} \]

Take the log of both sides (and rearrange in a special way):

\[ \log P(\mathbf x_i | \boldsymbol \Theta) = \log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta) - \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)} + \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)} \]

Variational Inference

Finally, note that:

\[ E_Q[\log P(\mathbf x_i | \boldsymbol \Theta)] = \log P(\mathbf x_i | \boldsymbol \Theta) \]

since the marginal likelihood doesn’t depend on \(Q\)

  • Find the expected value of the inner quantity with respect to the approximating distribution

Applying this expectation across our quantity:

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - E_Q\left[\log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)}\right] + E_Q \left[ \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)}\right] \]

Variational Inference

Goal - maximize this quantity:

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - E_Q\left[\log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)}\right] + E_Q \left[ \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)}\right] \]

Variational Inference

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] \]

is a measure of the expected reconstruction error w.r.t. our approximation

Assuming \(P()\) is a normal distribution like in factor analysis:

\[ \propto \exp\left[-\frac{1}{2} (\mathbf x_i - f(\mathbf z_i; \Theta))^T \boldsymbol \Psi^{-1} (\mathbf x_i - f(\mathbf z_i; \Theta)) \right] \]

we’re getting a squared difference between the input and the reconstructed input

  • Low difference maps to high probability since the normal distribution is centered at the mean

The maximum of the quantity is achieved when the reconstruction error is lowest!

Variational Inference

The second quantity is a special one called the KL Divergence

\[ D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) = \int Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi) \log \frac{Q(\mathbf z_i | \mathbf x_i , \boldsymbol \Phi)}{P(\mathbf z_i)} d\mathbf z_i \]

The KL divergence is a measure of “distance” between two distributions

  • Always greater than or equal to zero (see Gibbs’ inequality)

  • Only zero when \(Q = P\) for all values of \(\mathbf z_i\)

Variational Inference

In general, the KL divergence is really hard to compute

  • It’s an expectation over the log ratio of two unknown distributions

In the special case where \(Q\) and \(P\) are \(K\) dimensional multivariate normal distributions, there is a closed form expression for the KL divergence:

\[ Q \sim \mathcal N(\boldsymbol \mu_0 , \boldsymbol \Sigma_0) \text{ ; } P \sim \mathcal N(\boldsymbol \mu_1 , \boldsymbol \Sigma_1) \]

\[ D_{KL}(Q || P) = \]

\[ \frac{1}{2} \left( \text{tr}\left(\boldsymbol \Sigma^{-1}_1 \boldsymbol \Sigma_0 \right) - K + (\boldsymbol \mu_1 - \boldsymbol \mu_0)^T \boldsymbol \Sigma^{-1}_1 (\boldsymbol \mu_1 - \boldsymbol \mu_0) +\\ \log \left(\frac{\text{det} \boldsymbol \Sigma_1}{\text{det} \boldsymbol \Sigma_1}\right) \right) \]

Variational Inference

Let’s look at a figure in the notebook.

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]

  • The first KL divergence can be computed is we have a multivariate normal approximation and a multivariate normal prior - both are learned/fixed at the beginning

Our goal is to maximize this quantity

  • Maximize the first term

  • Since KL is always positive, minimize the second term

  • Minimize the difference between the proposal and the prior!

Variational Inference

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i | \mathbf x_i)) \]

The second KL divergence is the distance between the proposal and the unknown conditional

  • We don’t know what \(P(\mathbf z_i | \mathbf x_i , \boldsymbol \Theta)\) is…

  • If our proposal is the same as the conditional, then the KL is zero

These two terms control the difference between the prior and the posterior

  • Some combination is ideal.

Variational Inference

Unfortunately, we don’t know this term!

  • But, it is always greater than zero

The evidence lower bound (ELBo):

\[ \log P(\mathbf x_i | \boldsymbol \Theta) \ge E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]

  • Optimize this quantity which is hopefully close to the true value!

  • In practice, if the resulting posterior is approximately normal, then it’s pretty good!

  • Bayes CLT states that as \(N \to \infty\), posteriors converge to multivariate normals

Variational Inference

This generic distributional optimization strategy is called variational inference

  • Learn \(Q_i(\mathbf z_i | \mathbf x_i)\) for each observation that maximizes the log marginal likelihood.

In a second, we’re going to see an alteration of this method called amortized variational inference

  • Don’t learn different distributions - just learn a mapping of the input to the moments of the variational distributions!

  • Way faster.

Variational Autoencoders

For the generic latent variable model:

\[ P(\mathbf z) \sim \mathcal N_K(\mathbf 0 , \mathcal I_K) \]

\[ P(\mathbf x | \mathbf z) \sim \mathcal N_P(f(\mathbf z), \boldsymbol \Sigma_x) \]

\[ Q(\mathbf z | \mathbf x) \sim \mathcal N_K((g(\mathbf x), \boldsymbol \Sigma_z) \]

Find values for parameters that minimize the negative variational lower bound:

\[ -E_Q[\log P(\mathbf x | \mathbf z )] + D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]

Variational Autoencoders

Factor analysis is a special case of this model where we actually know \(P(\mathbf z_i | \mathbf x_i)\):

\[ P(\mathbf z) \sim \mathcal N_K(\mathbf 0 , \mathcal I_K) \]

\[ P(\mathbf x | \mathbf z) \sim \mathcal N_P(W \mathbf z_i, \boldsymbol \Sigma_x) \]

\[ Q(\mathbf z | \mathbf x) \sim \mathcal N_K(\mathbf W^T (\mathbf W \mathbf W^T + \boldsymbol \Psi)^{-1}\mathbf x, \mathcal I_K - \mathbf W^T(\mathbf W \mathbf W^T + \boldsymbol \Psi)^{-1} \mathbf W) \]

Variational Inference

Variational Autoencoders

Variational Autoencoders

Note that this construction is pretty arbitrary about the mapping between \(\mathbf x\) and the mean and covariance for \(Q()\) and \(\mathbf z\) to \(P()\)

Like deterministic autoencoders, we can replace the linear mapping with an arbitrary function learned via a deep model

  • The in-between parts can be FCNN and/or CNN backbones!

  • No different than deterministic autoencoders

  • Called amortized variational inference since we’re not learning posteriors, per se

  • Just learning functions

Main difference:

  • Learning distributions instead of point mappings to and from the latent space!

Variational Autoencoders

The generic routine:

  • Specify a decoder likelihood w.r.t. the input - \(P(\mathbf x | f(\mathbf z))\)

  • Specify a prior on the latent variables in \(K\) dimensions - \(P(z)\)

  • Specify an approximate posterior over the latents - \(Q(\mathbf z | g(\mathbf x))\)

Learn \(g()\) and \(f()\) that maximize the ELBo:

\[ E_Q[\log P(\mathbf x_i | \mathbf z_i , \boldsymbol \Theta)] - D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) \]

Variational Autoencoders

Why is this helpful?

Let’s draw a picture on the board.

Main point:

  • We’re creating a mapping from a point input to a distribution in the latent space

  • Mapping the latent distribution to another distribution over reconstruction candidates

  • Fills in the gaps in a way that deterministic autoencoders do not!

Variational Autoencoders

Some practical considerations:

  • You’ll often just skip the second distributional draw and allow all uncertainty to be propagated upwards from the latent distribution. Doesn’t make too much of a difference in the final model.

  • We’ll almost always want to restrict the covariance matrix for the mapping of \(\mathbf x\) to \(\mathbf z\) to be diagonal. This makes the latent space orthogonal and allows easy separation of sources of variation

  • You’ll almost always want to make \(Q(\mathbf z | \mathbf x)\) and \(P(\mathbf x | \mathbf z)\) multivariate normal distributions. Similarly, you’ll want to make your prior over the latent space multivariate normal with mean 0 and identity covariance. This makes the computations of KL divergence needed tractable.

Variational Autoencoders

With a standard normal prior in \(K\) dimensions and a diagonal normal proposal in \(K\) dimensions, there is a simple form for the KL divergence

\[ D_{KL}(Q(\mathbf z_i | \mathbf x_i) || P(\mathbf z_i)) = \frac{1}{2} \sum \limits_{k = 1}^K \left[\sigma^2_k + \mu^2_k - 1 - \log(\sigma^2_k)\right] \]

Variational Autoencoders

Let’s go through an example VAE in PyTorch.

Variational Autoencoders

VAEs produce way more coherent generated images than other latent variable methods!

  • Can sample new images from the prior

Next time, we’ll briefly touch on two things with VAEs

  • Editing images

  • \(\beta\)-VAEs

Then, we’ll start our discussion of normalizing flow models